251 research outputs found

    Descomposiciones ortogonales para el cálculo del rango numérico matricial

    Get PDF
    El cálculo del rango numérico matricial surge en numerosas aplicaciones de la ciencia y de la ingeniería. Actualmente existen tres aproximaciones numéricas básicas para efectuar este cálculo: la descomposición SVD, la descomposición URV y las descomposiciones QE reveladoras de rango (QR1IH). En este trabajo se analizan experimentalmente varios algoritmos secuenciales, basados en las tres aproximaciones anteriores para el cálculo del rango numérico matricial. Así, en el estudio comparativo experimental se emplea una implemeutación propia para el cálculo de la descomposición URV y dos nuevas rutinas para el cálculo de la descomposición QRRR. Además se utilizan las rutinas de la librería LAPACK para el cálculo de la descomposición SVD y la descomposición QR con pivotamiento de columnas. Los resultados experimentales muestran que la descomposición QEUR es en la práctica tan fiable como las costosas descomposiciones SVD y URV. Además, estas descomposiciones QRRR presentan la ventaja fundamental de su bajo coste computacional.Peer Reviewe

    Efficient Numerical Algorithms for Balanced Stochastic Truncation

    Get PDF
    We propose an efficient numerical algorithm for relative error model reduction based on balanced stochastic truncation. The method uses full-rank factors of the Gramians to be balanced versus each other and exploits the fact that for large-scale systems these Gramians are often of low numerical rank. We use the easy-to-parallelize sign function method as the major computational tool in determining these full-rank factors and demonstrate the numerical performance of the suggested implementation of balanced stochastic truncation model reduction

    Applying OOC Techniques in the Reduction to Condensed Form for Very Large Symmetric Eigenproblems on GPUs

    Get PDF
    In this paper we address the reduction of a dense matrix to tridiagonal form for the solution of symmetric eigenvalue problems on a graphics processor (GPU) when the data is too large to fit into the accelerator memory. We apply out-of-core techniques to a three-stage algorithm, carefully redesigning the first stage to reduce the number of data transfers between the CPU and GPU memory spaces, maintain the memory requirements on the GPU within limits, and ensure high performance by featuring a high ratio between computation and communication

    Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processors

    Full text link
    [EN] We present a novel method for the QR factorization of large tall-and-skinny matrices that introduces an approximation technique for computing the Householder vectors. This approach is very competitive on a hybrid platform equipped with a graphics processor, with a performance advantage over the conventional factorization due to the reduced amount of data transfers between the graphics accelerator and the main memory of the host. Our experiments show that, for tall¿skinny matrices, the new approach outperforms the code in MAGMA by a large margin, while it is very competitive for square matrices when the memory transfers and CPU computations are the bottleneck of the Householder QR factorizationThis research was supported by the Project TIN2017-82972-R from the MINECO (Spain) and the EU H2020 Project 732631 "OPRECOMP. Open Transprecision Computing".Tomás Domínguez, AE.; Quintana-Ortí, ES. (2020). Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processors. The Journal of Supercomputing (Online). 76(11):8771-8786. https://doi.org/10.1007/s11227-020-03176-3S877187867611Abdelfattah A, Haidar A, Tomov S, Dongarra J (2018) Analysis and design techniques towards high-performance and energy-efficient dense linear solvers on GPUs. IEEE Trans Parallel Distrib Syst 29(12):2700–2712. https://doi.org/10.1109/TPDS.2018.2842785Ballard G, Demmel J, Grigori L, Jacquelin M, Knight N, Nguyen H (2015) Reconstructing Householder vectors from tall-skinny QR. J Parallel Distrib Comput 85:3–31. https://doi.org/10.1016/j.jpdc.2015.06.003Barrachina S, Castillo M, Igual FD, Mayo R, Quintana-Ortí ES (2008) Solving dense linear systems on graphics processors. In: Luque E, Margalef T, Benítez D (eds) Euro-Par 2008—parallel processing. Springer, Heidelberg, pp 739–748Benson AR, Gleich DF, Demmel J (2013) Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures. In: 2013 IEEE International Conference on Big Data, pp 264–272. https://doi.org/10.1109/BigData.2013.6691583Businger P, Golub GH (1965) Linear least squares solutions by householder transformations. Numer Math 7(3):269–276. https://doi.org/10.1007/BF01436084Demmel J, Grigori L, Hoemmen M, Langou J (2012) Communication-optimal parallel and sequential QR and LU factorizations. SIAM J Sci Comput 34(1):206–239. https://doi.org/10.1137/080731992Dongarra J, Du Croz J, Hammarling S, Duff IS (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17. https://doi.org/10.1145/77626.79170Drmač Z, Bujanović Z (2008) On the failure of rank-revealing qr factorization software—a case study. ACM Trans Math Softw 35(2):12:1–12:28. https://doi.org/10.1145/1377612.1377616Fukaya T, Nakatsukasa Y, Yanagisawa Y, Yamamoto Y (2014) CholeskyQR2: A simple and communication-avoiding algorithm for computing a tall-skinny QR factorization on a large-scale parallel system. In: 2014 5th workshop on latest advances in scalable algorithms for large-scale systems, pp 31–38. https://doi.org/10.1109/ScalA.2014.11Fukaya T, Kannan R, Nakatsukasa Y, Yamamoto Y, Yanagisawa Y (2018) Shifted CholeskyQR for computing the QR factorization of ill-conditioned matrices, arXiv:1809.11085Golub G, Van Loan C (2013) Matrix computations. Johns Hopkins studies in the mathematical sciences. Johns Hopkins University Press, BaltimoreGunter BC, van de Geijn RA (2005) Parallel out-of-core computation and updating the QR factorization. ACM Trans Math Softw 31(1):60–78. https://doi.org/10.1145/1055531.1055534Joffrain T, Low TM, Quintana-Ortí ES, Rvd Geijn, Zee FGV (2006) Accumulating householder transformations, revisited. ACM Trans Math Softw 32(2):169–179. https://doi.org/10.1145/1141885.1141886Puglisi C (1992) Modification of the householder method based on the compact WY representation. SIAM J Sci Stat Comput 13(3):723–726. https://doi.org/10.1137/0913042Saad Y (2003) Iterative methods for sparse linear systems, 3rd edn. Society for Industrial and Applied Mathematics, PhiladelphiaSchreiber R, Van Loan C (1989) A storage-efficient WY representation for products of householder transformations. SIAM J Sci Comput 10(1):53–57. https://doi.org/10.1137/0910005Stathopoulos A, Wu K (2001) A block orthogonalization procedure with constant synchronization requirements. SIAM J Sci Comput 23(6):2165–2182. https://doi.org/10.1137/S1064827500370883Strazdins P (1998) A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Tech. Rep. TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, AustraliaTomás Dominguez AE, Quintana Orti ES (2018) Fast blocking of householder reflectors on graphics processors. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp 385–393. https://doi.org/10.1109/PDP2018.2018.00068Volkov V, Demmel JW (2008) LU, QR and Cholesky factorizations using vector capabilities of GPUs. Tech. Rep. 202, LAPACK Working Note. http://www.netlib.org/lapack/lawnspdf/lawn202.pdfYamamoto Y, Nakatsukasa Y, Yanagisawa Y, Fukaya T (2015) Roundoff error analysis of the Cholesky QR2 algorithm. Electron Trans Numer Anal 44:306–326Yamazaki I, Tomov S, Dongarra J (2015) Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs. SIAM J Sci Comput 37(3):C307–C330. https://doi.org/10.1137/14M097377

    Toward matrix multiplication for deep learning inference on the Xilinx Versal

    Full text link
    The remarkable positive impact of Deep Neural Networks on many Artificial Intelligence (AI) tasks has led to the development of various high performance algorithms as well as specialized processors and accelerators. In this paper we address this scenario by demonstrating that the principles underlying the modern realization of the general matrix multiplication (GEMM) in conventional processor architectures, are also valid to achieve high performance for the type of operations that arise in deep learning (DL) on an exotic accelerator such as the AI Engine (AIE) tile embedded in Xilinx Versal platforms. In particular, our experimental results with a prototype implementation of the GEMM kernel, on a Xilinx Versal VCK190, delivers performance close to 86.7% of the theoretical peak that can be expected on an AIE tile, for 16-bit integer operands.Comment: 11 page

    Energy aware execution environments and algorithms on low power multi-core architectures

    Get PDF
    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.Energy consumption is a key aspect that conditions the proper functioning of nowadays data centers and high performance computing just like the launch of new services, due to its environmental negative impact and the increasing economic costs of energy. The energy efficiency of the applications used in these data centers could be improved, especially when systems’ utilization rate is low or moderate, or when targeting memory bounded applications. In this sense, energy proportionality stands for systems which power consumption is in line with the amount of work performed in each moment. As a response to these needs, the main objective of this project is to study, design, develop and analyze experimental solutions (models, programs, tools and techniques) aware of energy proportionality for scientific and engineering applications on low-power architectures. With the aim of showing the benefits of this contribution, two applications, coming from the image processing and dynamic molecular simulation fields, have been chosen.European Cooperation in Science and Technology. COS

    Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

    Get PDF
    Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric--static and dynamic scheduling strategies that carefully tune and distribute the operation's micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency

    Solution of Few-Body Coulomb Problems with Latent Matrices on Multicore Processors

    Get PDF
    We re-formulate a classical numerical method for the solution of systems of linear equations to tackle problems with latent data, that is, linear systems of dimension that is a priori unknown. This type of systems appears in the solution of few-body Coulomb problems for Atomic Simulation Physics, in the form of multidimensional partial differential equations (PDEs) that require the numerical solution of a sequence of recurrent dense linear systems of growing scale. The large dimension of these systems, with up to several hundred thousands of unknowns, is tackled in our approach via a task-parallel implementation of the solver, using the OmpSs framework.Fil: Biedma, Luis Ariel. Universidad Nacional de Córdoba. Facultad de Matemática, Astronomía y Física; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Colavecchia, Flavio Dario. Comisión Nacional de Energí­a Atómica. Gerencia del Area Investigación y Aplicaciones No Nucleares. Gerencia de Física (Centro Atómico Balseiro). División Colisiones Atómicas; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Quintana Ortí, Enrique. Universitat Jaume I; EspañaInternational Conference on Computational Science, ICCS 2017ZurichSuizaETH ZürichUniversiteit Van AmsterdamUniversity of TennesseeNanyang Technological Universit

    A Review of Lightweight Thread Approaches for High Performance Computing

    Get PDF
    High-level, directive-based solutions are becoming the programming models (PMs) of the multi/many-core architectures. Several solutions relying on operating system (OS) threads perfectly work with a moderate number of cores. However, exascale systems will spawn hundreds of thousands of threads in order to exploit their massive parallel architectures and thus conventional OS threads are too heavy for that purpose. Several lightweight thread (LWT) libraries have recently appeared offering lighter mechanisms to tackle massive concurrency. In order to examine the suitability of LWTs in high-level runtimes, we develop a set of microbenchmarks consisting of commonly-found patterns in current parallel codes. Moreover, we study the semantics offered by some LWT libraries in order to expose the similarities between different LWT application programming interfaces. This study reveals that a reduced set of LWT functions can be sufficient to cover the common parallel code patterns andthat those LWT libraries perform better than OS threads-based solutions in cases where task and nested parallelism are becoming more popular with new architectures.The researchers from the Universitat Jaume I de Castelló were supported by project TIN2014-53495-R of the MINECO, the Generalitat Valenciana fellowship programme Vali+d 2015, and FEDER. This work was partially supported by the U.S. Dept. of Energy, Office of Science, Office of Advanced Scientific Computing Research (SC-21), under contract DEAC02-06CH11357. We gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory.Peer ReviewedPostprint (author's final draft
    corecore